Connected Corp. Backup Server Architecture
wsmith, 22 September 1999
This is my current understanding of Connected Corporation’s backup server architecture. I’m not trying to do any analysis here, just to write down what they’ve told us about their system. We have had one teleconference and one in-person visit with them so far.
Introduction
Connected’s service makes backups of PCs over the network. Backups are done periodically (usually daily) by uploading incremental changes to the server. Restores are either done “live” over the net or by burning a CD-ROM.
Their main business is currently in the enterprise, where a company signs up to have Connected do the backups over the Internet. I believe they also have a few customers who run their own internal data centers using Connected’s technology. They have an offering for individual PC’s, but they are not currently doing active marketing to consumers, so there are only a few thousand consumer customers.
To get the necessary scalability, reliability, and low-bandwidth operation for their service, Connected uses these techniques:
User-based server partitioning-enables easy linear parallelism
Optimized data transfer-only the changed portions of files are sent to the server
Single instance store equivalent-shared files are stored only once in the archive
Archive clustering-data on tape is carefully arranged to minimize tape handling
Server pairs-each user’s state is mirrored on two geographically separated servers
Here are some statistics for Connected’s current operation:
12,800 users, 8000 of which are “active”
5900GB (compressed) total data
104GB of data shared between multiple users, about 25% of which is shared by only two users
2-3Mbps Internet bandwidth (95th percentile usage)
Usage fairly evenly distributed throughout the day (peak hour is 8% of day)
Data flow is about 90% inward (backups), 10% outward (restores + web site)
4MB data transferred in the average online restore
About eight CD-ROM restores per day
These numbers are skewed by a few high-volume corporate users, so they don’t have very good data on pure consumer usage.
Topology
The basic unit of the server topology is a mirrored pair of backup servers. Each user is assigned permanently to one server pair (migration is theoretically possible, but expensive due to the amount of data that would have to be transferred between the servers). Each backup server is associated with a SQL Server database used to store the bookkeeping information about its users’ backups and an “archive cache”, which is a RAID array used as a staging area for archived data.
Server pairs are grouped into clusters. Each cluster has two mirrored registration servers (SQL Server) that maintain user account information for all the users on the cluster. A cluster also has two or more HSM servers that operate automated tape library systems. Connected wrote their own tape management software.
A cluster is separated geographically into two mirrored halves. Each half-cluster contains a registration server, at least one tape library, and one server from each backup server pair. A high-bandwidth pipe connects the half-clusters together. (Connected currently uses six T1 lines for this.)
Mirrored pair of backup servers
Data center cluster
This topology is hypothetical to some degree, since Connected has not yet outgrown their first backup server pair. At present, they run each half-cluster on a single Proliant server (dual P3-450, 512K RAM) with a 150GB RAID array and 9TB (?) StorageTek tape library.
Backup lifecycle
I will go through the normal sequence of events and then discuss the details of some interesting parts of the system.
Registration
A new user begins by using the Connected client to register for the service. The client connects to the backup cluster and goes through a registration protocol with the cluster’s registration server. The registration server assigns the client to a backup server pair and records the account information in its database. The client records the identity of both its backup servers.
Backup
A backup can be triggered manually, at shutdown, opportunistically in the background, or on a scheduled basis. The schedule settings allow some leeway so the clients can randomize their start times to avoid load peaks on the hours. The typical client does a backup about 18 times a month.
The client starts a backup by connecting to one of its two backup servers. They perform an initial handshake in which the client authenticates itself and the server sends down any upgraded client components. The server also notifies the client of any file versions that have been expired from the backup library and thus can no longer be restored.
Client and server then establish a “session”. A session is a set of complete file updates transferred in one connection. The client doesn’t have to finish sending every update it has in one session, although it tries. If a session is interrupted, the next one starts fresh-each session is independent from previous sessions. However, if the client does succeed in sending a full update in one session, that session is considered “safe” for purposes of a full restore. At the end of each session, the backup server adds an activity record to the registration server’s database.
The client maintains a database of the state of the disk at the completion of the last session, so all file comparisons can take place locally. The client and server must agree on the contents of the database and the identity of the last session the server received. For various reasons, the client may have old or corrupted state-or no state at all, in a total rebuild situation. It may be that the server was unable to replicate and the client is connecting to the other server. If they don’t agree, the client database is reconstructed from the server. (The client could also see if the other server is better-not sure if this happens.)
The client scans the disk, comparing modification dates with those stored in the local database and looking for new or deleted files. New and changed files are delta-compressed and sent to the server (more about that in a separate section), and deletions are noted. As each change is sent, the server adds it to the archive and acknowledges the change, at which point the client updates its database.
Changes are stored on the archive cache RAID array in “archive sets”, which are the unit of transfer to and from tape. Each archive set contains changes from one session, but the size of an archive set is limited to 5MB to ensure good tape performance, so a large session may result in multiple archive sets.
In the case of an unexpected disconnection, the server retains the state of the current file transfer so the client can reconnect and continue from where it left off. I believe a new session would be started since sessions contain only complete file changes.
If a file is open when the scan reaches it, the file is copied twice and the copies are compared. If they are identical, the file is assumed to be stable, and it is backed up from the copy. If the file can’t be opened or is unstable, the client will continue trying to get a stable copy of it at intervals so that it can eventually be backed up. The registry files, which can never be opened, are handled specially (probably using scanreg.exe).
The servers in a backup pair replicate user data to each other to maintain a perfect mirror. If the servers are disconnected, the client can still connect to either one of them, and they will sort it all out later during the replication process. We didn’t discuss the mirroring process in detail.
Migration to tape
When the archive cache becomes sufficiently full, the HSM server copies new archive sets to the tape library. Of course, while tapes have a respectable data transfer rate, their seek time is abysmal, so the tape library can become a major bottleneck in the restore process. Connected addresses this by being careful about putting related archive sets near each other on tape. They say they have put a lot of engineering effort into this problem.
We don’t have the details of their algorithm, but they say each tape contains data from a limited number of users. Presumably they try to keep all of a user’s recent data on the same tape. There is a “compaction” process that coalesces old deltas so a restore has fewer archive sets to retrieve. There may also be a process that rearranges archive sets on tape (again keeping related archive sets close), but this is unclear.
Shared files (see the section on delta compression) are stored in their own set of tapes, separate from all the user-specific files. Many of these files are shared by only two clients, and Connected will probably make a separate pool for them.
Restore
Because the client has a full local record of the contents of the server, the entire process of selecting files and planning the restore operations happens without any server involvement. Connected has a fairly complicated UI for choosing what to do in a selective restore. For a total disk rebuild (which Connected calls “repair” as opposed to “restore”), the process is automatic.
Once the needed files are known, the client connects to a backup server and sends the list of files it needs. The server uses its database to determine which archive sets must be retrieved, and requests them from the HSM server, which begins retrieving them from the tape library and depositing them on the archive cache RAID array. As the archive sets arrive in the cache, the server reassembles the full files from the deltas in the archives and sends them to the client. Finally, the backup server adds an activity record to the registration server’s database.
The latency of the restore operation can vary quite a bit. The most important variable is whether the tape library needs to be involved, and if so, how many tapes must be accessed to get all the data. A recently-archived file will still be in the archive cache, so restoring it is very fast. A file on tape with one delta might take two minutes to retrieve, while one with many deltas could take 15 minutes. The worst case is something like an Outlook PST file, which is huge, long-lived, and changes constantly.
As an alternative to the live online restore process for large restores or repairs, Connected can burn a CD-ROM containing the archived files and restore software. Standalone PCs with CD-R drives are used to produce the CD-ROMs. The process is essentially the same, except that the standalone PC rather than the backup server gathers the archive sets from the HSM server and assembles them into files. Because CD-ROM are allowed to take 24 hours to complete, the HSM server gives online restore operations higher priority.
So few CD-ROMs are currently needed that Connected has not automated anything but the actual CD build process. An operator manually inserts a CD-R blank into the drive and selects a pending restore from the queue. When the CD finishes, the operator removes it and labels it for delivery.
Delta compression
The key to low-bandwidth backups is compressing the changes as much as possible. Connected is very proud of their compression scheme.
As mentioned earlier, the client maintains a database of the last state of the disk that was sent to the server. For each file, the database contains the “truename”, last-modified date, and size of the file. The truename is a hash of the entire contents of the file, using a proprietary hashing algorithm. There is also a set of “block hashes”, which are 8-byte hashes of 512-byte blocks of the file.
To transfer a new or changed file, the client first sends the truename of the file. If the server already has a file with the same truename, no data needs to be transferred; the server just makes a “link” to the existing file. (Presumably it will be migrated to the shared pool the first time this happens.) Otherwise, the client generates a new set of block hashes and “diffs” them with the old hashes, taking into account shifts as well as changes. Being able to shift contents around works well for things like PST files that have been compacted. The client sends a “delta”, which is the truename of the old file plus the block edit list that generates the new file.
Consider the case where a file has been changed, then renamed or moved. It’s not obvious to the backup scan that this is the same file, but it would be inefficient to send the whole thing when deltas would suffice. To handle this, the client scans the database looking for close matches to the file. If it finds one, it uses that as the basis of the delta (sending that file’s truename with the block edit list). Making this operation fast was apparently one of the driving requirements of Connected’s proprietary hash algorithm.
Encryption
For clarity, I haven’t mentioned it earlier, but all user data is encrypted before it leaves the client machine, and stays encrypted throughout its entire lifetime until it reaches the client machine again. Even the data on a restore CD-ROM is encrypted. Each account has a fixed key that is set by the user (or it can be randomly generated). Each file (or delta-not sure) has an encryption key that is stored encrypted by the account key. The customer can choose 40-bit, 56-bit, or triple-DES encryption.
Since the delta compression algorithm is based on hashes generated by the client, the server doesn’t need to see the actual data to make it work. Thus, it might seem that Connected never needs to see the account keys. However, the shared file optimization fails if the server doesn’t have the keys. Although the hash is sufficient to do the backup operation-figuring out that user B has already stored user A’s file on the server-the restore operation has to produce the file’s contents encrypted with user A’s key, which is impossible unless it has both users’ keys.
For that reason, Connected generally escrows account keys. They allow corporate customers to opt out of escrow if they insist, but few do. SOHO customers are not given the option. This is not only for storage optimization, but for support reasons-if a SOHO customer loses the account key, there is no way to restore their data without key escrow.
Connected claims to have appropriate mechanisms in place to ensure security of keys. Key administration is done through a system restricted to privileged operators that maintains an audit trail. Corporate customers do their own key administration using a remote tool (actually a website, I think). Several corporate customers have performed security audits on Connected.
Scalability
We asked Connected to estimate server requirements when scaling their system up to a larger user base. They estimated that a single server pair could support about 15,000 users, but cautioned that this number was somewhat fuzzy since they haven’t maxed out one pair yet, and their customer base is rather different from the consumer-oriented service we’re investigating. They think one tape library can probably support three backup servers.
Connected thinks the current bottleneck on the server is the RAID array. They are switching RAID vendors to eliminate this and expect the next bottleneck to be the SQL Server that stores the archive data. Their CPU usage is around 20% now, so they don’t see that as a problem anytime soon.
Because each user is tied to a particular server, it seems that the server topology should scale to any size by simply adding more machines (similar to Hotmail’s “capitalization units”). It should also be straightforward to divide the service geographically by clusters (or even within clusters-one corporate client divides each backup pair between California and Massachusetts).
Operations
Thanks to the tape robots, most daily operations are automated. Manual processes include CD-ROM burning (see above), SQL Server backup, and mechanical maintenance of the tape library. End-to-end tests are done continuously to alert the staff to any problems: two machines do backups and restores of a test file over Connected’s LAN as well as through modem connections to unrelated ISPs. Connected estimates one staff person could handle operation of several backup servers. Again, they haven’t grown large enough to have an accurate staffing estimate.
Connected has two data centers a few miles apart. Each data center connects to the Internet through a different ISP using three T1 lines, and they are connected together using six T1 lines.
Business issues
Connected’s current offering for SOHO users is unlimited usage for $19.95/month. The fine print allows them to kick off outliers such as the grad student who was uploading 1GB/day of weather maps. However, they currently have no automated tools to detect such things-they found the grad student by manual investigation of where all the space was going.
They are introducing a new service that attempts to back up only data files, using a inclusion list of file extensions. Because the volume will be less, they expect to be able to offer this service for more like $7/month.
Many of Connected’s competitors charge by the megabyte rather than using a flat rate. Connected tried this, but found they had too many support calls from people who didn’t understand their bills. Other flat-rate competitors have some sort of quota enforcement on the amount of data transferred or even the pre-transfer size, but Connected has relied on dealing with individual heavy users rather than introducing a quota system.